Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Counts the number of lines correctly for files with certain multibyte encodings #1211

Merged
merged 3 commits into from
May 22, 2014

Conversation

alindeman
Copy link
Contributor

Currently, loc and sloc are incorrect for UTF-16 encoded files with Windows line endings (example). This is because data and the newline regular expression are encoded as ASCII-8BIT or UTF-8, even if the encoding is detected as something else. In the case of UTF-16LE, Windows newlines are encoded as "\r\0\n\0". Because of the \0 separating the \r and \n, the current regular expression (\r|\n) counts two lines even though it should only be interpreted as one single line break.

The symptom is that UTF-16LE encoded files with Windows line endings are rendered with whitespace at the bottom that doesn't actually exist because the line count is calculated as twice what it should be.

/cc: @github/encodings

@jdennes
Copy link
Contributor

jdennes commented May 21, 2014

Thanks for the quick fix! ⚡

@alindeman
Copy link
Contributor Author

Thanks for the quick fix! ⚡

np, was a fun one to track down. I'd still like some 👀 and a 👍 from @github/encodings before I try this out.

@brianmario
Copy link
Contributor

nice catch! 👍

@alindeman alindeman mentioned this pull request May 21, 2014
alindeman added 2 commits May 21, 2014 13:36
It looks like it's valid to call this method even if `binary?` is true.
Encoding as 'ASCII-8BIT' should always succeed.
@alindeman
Copy link
Contributor Author

Awkwardly enough, we've made an implicit guarantee that the data exposed through lines is encoded in its original form (in practice, ASCII-8BIT or UTF-8) even if the file is detected as a different encoding. Certain other code seems to actually depend on this.

For now, I'll make a change forcing each line back to the original encoding unless someone has a better idea.

encoded_newlines = ["\r\n", "\r", "\n"].
map { |nl| nl.encode(encoding).force_encoding(data.encoding) }

data.split(Regexp.union(encoded_newlines), -1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New strategy that keeps each entry in lines the same encoding as data, which (for GitHub blobs) is ASCII-8BIT. Seem OK, @github/encodings?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hot, I like that

alindeman added a commit that referenced this pull request May 22, 2014
Counts the number of lines correctly for files with certain multibyte encodings
@alindeman alindeman merged commit 6a192da into github-linguist:master May 22, 2014
@alindeman alindeman deleted the multibyte_line_count branch May 22, 2014 15:27
smola added a commit to smola/linguist that referenced this pull request Dec 1, 2019
Some files failed with "invalid byte sequence in UTF-8 (ArgumentError)"
when BlobHelper#lines was called. Some problematic files include
UTF-16LE samples such as test/fixtures/Data/utf16le.

Errors were not present when computing stats from git repositories,
since git blobs are always read as ASCII-8BIT and that was working
correctly. However, when using FileBlob, encoding could be ASCII-8BIT,
UTF-8 or other, depending on the runtime value of Encoding.external_encoding.

Tests were not catching the error since they were forcing
Encoding.external_encoding to be ASCII-8BIT (introduced in github-linguist#1211). So the
error would only be seen in wild usage (see issue github-linguist#353).

This commit forces ASCII-8BIT on File.read calls. The error is still
present if using memory blobs with other encodings.
lildude pushed a commit that referenced this pull request Mar 16, 2020
Some files failed with "invalid byte sequence in UTF-8 (ArgumentError)"
when BlobHelper#lines was called. Some problematic files include
UTF-16LE samples such as test/fixtures/Data/utf16le.

Errors were not present when computing stats from git repositories,
since git blobs are always read as ASCII-8BIT and that was working
correctly. However, when using FileBlob, encoding could be ASCII-8BIT,
UTF-8 or other, depending on the runtime value of Encoding.external_encoding.

Tests were not catching the error since they were forcing
Encoding.external_encoding to be ASCII-8BIT (introduced in #1211). So the
error would only be seen in wild usage (see issue #353).

This commit forces ASCII-8BIT on File.read calls. The error is still
present if using memory blobs with other encodings.
lildude pushed a commit that referenced this pull request Mar 16, 2020
Some files failed with "invalid byte sequence in UTF-8 (ArgumentError)"
when BlobHelper#lines was called. Some problematic files include
UTF-16LE samples such as test/fixtures/Data/utf16le.

Errors were not present when computing stats from git repositories,
since git blobs are always read as ASCII-8BIT and that was working
correctly. However, when using FileBlob, encoding could be ASCII-8BIT,
UTF-8 or other, depending on the runtime value of Encoding.external_encoding.

Tests were not catching the error since they were forcing
Encoding.external_encoding to be ASCII-8BIT (introduced in #1211). So the
error would only be seen in wild usage (see issue #353).

This commit forces ASCII-8BIT on File.read calls. The error is still
present if using memory blobs with other encodings.
lildude added a commit that referenced this pull request Mar 19, 2020
* Prepare 7.9.0 release

* Put back the v7.8.0 version

We need this to ensure the versioning used during testing on GitHub.com doesn't cause caching problems in future

* fix errors on non-UTF-8 encodings

Some files failed with "invalid byte sequence in UTF-8 (ArgumentError)"
when BlobHelper#lines was called. Some problematic files include
UTF-16LE samples such as test/fixtures/Data/utf16le.

Errors were not present when computing stats from git repositories,
since git blobs are always read as ASCII-8BIT and that was working
correctly. However, when using FileBlob, encoding could be ASCII-8BIT,
UTF-8 or other, depending on the runtime value of Encoding.external_encoding.

Tests were not catching the error since they were forcing
Encoding.external_encoding to be ASCII-8BIT (introduced in #1211). So the
error would only be seen in wild usage (see issue #353).

This commit forces ASCII-8BIT on File.read calls. The error is still
present if using memory blobs with other encodings.

* Decrease expected error count

* Set version to 7.9.0

Co-authored-by: Rick Winfrey <[email protected]>
Co-authored-by: Santiago M. Mola <[email protected]>
@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 18, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants